Starting a bit late, but very hopeful! I am really interested in the topic, but technical difficulties are driving me insane. Also very lost for now.
Let’s start by reading the dataset that we previously wrangled.
learning2014 <- read.csv("./data/learning2014.csv")
For data structure, let’s see
dim(learning2014)
## [1] 166 8
str(learning2014)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
## X gender age attitude deep stra surf points
## 1 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 6 F 38 3.8 4.750000 3.625 2.416667 21
These data result form a survey conducted during an Introduction to Social Statistics course in the fall of 2014. This particular dataframe contains responses from 166 people who took the exam. There are 7 data entries for each participant: gender, age, attitude towards learning, questions related to deep learning, surface learning and strategic learning, as well as the exam grade.
summary(learning2014$gender)
## F M
## 110 56
110 women and 56 men undertook the survey
summary(learning2014$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 21.00 22.00 25.51 27.00 55.00
Their age ranged from 17 to 55 years old with a mean of 25.51
summary(learning2014$attitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 2.600 3.200 3.143 3.700 5.000
summary(learning2014$deep)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.583 3.333 3.667 3.680 4.083 4.917
summary(learning2014$stra)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.250 2.625 3.188 3.121 3.625 5.000
summary(learning2014$surf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.583 2.417 2.833 2.787 3.167 4.333
Their attitude, as well as the questions pertaining to deep, surface and strategic learning, were measured on a scale from 1 to 5
summary(learning2014$points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 19.00 23.00 22.72 27.75 33.00
The exam grades varied from 7 points to 33 with a mean of 22.72 The next step is to do the graphical exploration.
library(GGally)
## Loading required package: ggplot2
library(ggplot2)
pairs(learning2014[-1], col = learning2014$gender)
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
The plot demonstrates paired correlations between all the variables except gender. It is a binary variable and should not be simply correlated with continuous ones. Here gender is used in all of the paired plots for a more detailed visual analysis. We can also see the boxplots for each variable and analyze them visually. None of the variables contains an alarming amount of outliers except for age.
Before starting any analysis, it is good to look at the probability plots of each variable.
Options:
1. qqnorm( ) function to create a Quantile-Quantile plot evaluating the fit of sample data to the normal distribution.
2. The fitdistr( ) function in the MASS package provides maximum-likelihood fitting of univariate distributions. The format is fitdistr(x, densityfunction) where x is the sample data and density function is one of the following: “beta”, “cauchy”, “chi-squared”, “exponential”, “f”, “gamma”, “geometric”, “log-normal”, “lognormal”, “logistic”, “negative binomial”, “normal”, “Poisson”, “t” or “weibull”.
3. For other, see: https://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
If the distribution does satisfy the normality criterion, log-transform the variable to identify outliers more clearly.
Delete those observations.
Try step 1 again.
library(ggplot2)
Create a dummy variable with the same mean and standard deviation, but normally distributed, for comparison purposes:
absences_norm <- rnorm(200,mean=mean(absences, na.rm=TRUE), sd=sd(absences, na.rm=TRUE))Plot the two variables side by side
boxplot(absences, absences_norm, main = "Boxplots of the absences variable: actual distribution vs. normal", at = c(1,2), names = c("absences","absences_norm"), las = 2, col = c("pink","yellow"), horizontal = TRUE)library(tidyverse), library(MASS)
Calculate the correlation matrix and round it (tidyverse and corrplot packages)
cor_matrix <- cor(Boston)
cor_matrix_r <- cor_matrix %>% round(2)
# округлить до 2 цифр после запятойVisualize the correlation matrix
corrplot(cor_matrix_r, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
pairs() command (or ggpairs with GGally package)With pairs() you can reduce the number of pairs to see the plots more clearly. After the data argument, you have to specify the columns you want to see. Ex.: pairs(Boston[6:10]), col = km$cluster)
scale()
boston_scaled <- scale(Boston), # mean = 0, центрирует вокруг нуля
class(boston_scaled) # в результате шкалирования получается матрица
boston_scaled <- as.data.frame(boston_scaled) # меняем матрицу на data frame
Quantiles
1. Create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
Create a categorical variable ‘crime’
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))Look at the table of the new factor crime
table(crime)Remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)Add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)We are now interested in exploring whether we can explain the student exam grade by relating it with other available variables. A multiple regression model is fitted with a response variable ‘points’ and explanatory variables ‘attitude’, ‘stra’ and ‘surf’.
lm <- lm(points ~ attitude + stra + surf, data = learning2014)
lm
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Coefficients:
## (Intercept) attitude stra surf
## 11.0171 3.3952 0.8531 -0.5861
summary(lm)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The model with three explanatory variables is able to account for 21% of the variance (multiple R squared). It shows that the only variable with a statistically significant relationship with examn grade is ‘attitude’. Now we have to try out and take the two other factors one by one in a search of a more parsimonious model.
lm1 <- lm(points ~ attitude + stra, data = learning2014)
lm1
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Coefficients:
## (Intercept) attitude stra
## 8.9729 3.4658 0.9137
summary(lm1)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
lm2 <- lm(points ~ attitude + surf, data = learning2014)
lm2
##
## Call:
## lm(formula = points ~ attitude + surf, data = learning2014)
##
## Coefficients:
## (Intercept) attitude surf
## 14.120 3.426 -0.779
summary(lm2)
##
## Call:
## lm(formula = points ~ attitude + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.277 -3.236 0.386 3.977 10.642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.1196 3.1271 4.515 1.21e-05 ***
## attitude 3.4264 0.5764 5.944 1.63e-08 ***
## surf -0.7790 0.7956 -0.979 0.329
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 163 degrees of freedom
## Multiple R-squared: 0.1953, Adjusted R-squared: 0.1854
## F-statistic: 19.78 on 2 and 163 DF, p-value: 2.041e-08
Taking out the ‘surf’ or ‘stra’ variables did not affect the multiple R squared and thus the fit of the model. It makes it safe to conclude that they do not affect the exam grade and we can proceed with the only significant factor ‘attitude’.
lm3 <- lm(points ~ attitude, data = learning2014)
lm3
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Coefficients:
## (Intercept) attitude
## 11.637 3.525
summary(lm3)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The fitted model explains 19% of the variance, the attitude is concluded to be a statistically significant predictor for the exam grade.
Next, to make sure that the model is fitted correctly, residuals analysis is due.
Residuals vs. Fitted values
plot(lm3, which = c(1))
QQ plot (theoretical quantiles)
plot(lm3, which = c(2))
Residuals vs. Leverage
plot(lm3, which = c(5))
The residuals vs. fitted values are equally distributed and do not demonsrate any pattern. The QQ plot shows a nice fit along the line with no cause for concern. The Leverage plot does nor show any particular outliers.
The model is valid.
Determine the number of rows in the dataset
n <- nrow(boston_scaled)Choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)Create train set
train <- boston_scaled[ind,]Create test set by substraction the train set
test <- boston_scaled[-ind,]Save the correct classes from test data
correct_classes <- test$crime (any variable used for prediction)Remove this variable from test data
test <- dplyr::select(test, -crime)
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features; it was collected by using school reports and questionnaires.
alc <- read.csv("./data/alc.csv")
The dataset contains 35 variables and 382 observations pertaining to the performance, family, personal life and other information about students and their consumption of alcohol.
Here, I am interested in studying the relationships between high/low alcohol consumption and other variables in the data, namely, four of them: ‘absences’ (number of school absences), ‘traveltime’ (home to school travel time), ‘goout’ (going out with friends), and ‘famrel’ (quality of family relationships).
I hypothesize that the high number of absences and a lot of time with friends along with bad family relationships and short traveltime will increase odds of high alcohol consumption in students.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
keep_columns <- c("sex", "age", "absences", "traveltime", "goout", "famrel", "high_use")
alc <- select(alc, one_of(keep_columns))
sex <- alc$sex
age <- alc$age
absences <- alc$absences
traveltime <- alc$traveltime
goout <- alc$goout
famrel <- alc$famrel
high_use <- alc$high_use
We have now reformatted the dataset to only keep the columns pertaining to the hypothesis and two general variables, sex and age. Now onto the graphical exploration.
library(ggplot2)
# create a dummy variable with the same mean and standard deviation, but normally distributed, for comparison purposes
absences_norm <- rnorm(200,mean=mean(absences, na.rm=TRUE), sd=sd(absences, na.rm=TRUE))
# plot the two variables side by side
boxplot(absences, absences_norm, main = "Boxplots of the absences variable: actual distribution vs. normal", at = c(1,2), names = c("absences","absences_norm"), las = 2, col = c("pink","yellow"), horizontal = TRUE)
First, we need to analyze the only numeric variable, the number of absences. As seen in the boxplot, the actual distribution is within normal limits (with no negative values) and with a few outliers. Before excluding those observations, we will explore other variable and try and fit the model.
Now we will plot the interval variables with a scale of answers from 1 to 5.
boxplot(traveltime, goout, famrel, main = "Boxplots of interval variables", at = c(1,2,3), names = c("travel time","time with friends", "family relationships"), las = 2, col = c("blue","violet","green"))
The “goout” variable seems to be distributed normally, although two others are skewed in different directions. Next, we will plot each variable against the response variable and explore the dataset further.
library(GGally)
cross <- ggpairs(alc, mapping = aes(col = sex), lower = list(combo = wrap("facethist", bins = 20)))
cross
The variables have very low correlation values between themselves, so there is no risk of covariance tamperinf with the model. Now onto the model fitting.
m <- glm(high_use ~ absences + traveltime + goout + famrel, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + traveltime + goout + famrel,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0083 -0.7805 -0.5437 0.8464 2.3984
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.97770 0.68176 -4.368 1.26e-05 ***
## absences 0.07559 0.02202 3.432 0.000598 ***
## traveltime 0.43957 0.17532 2.507 0.012167 *
## goout 0.76735 0.12087 6.348 2.18e-10 ***
## famrel -0.36294 0.13756 -2.638 0.008331 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 389.46 on 377 degrees of freedom
## AIC: 399.46
##
## Number of Fisher Scoring iterations: 4
All of the explanatory variables have a statistically significant relationship with the target variable.
Next, we will calculate the odds ratios and confidence intervals to validate our model.
odds_ratios <- coef(m) %>% exp
conf_int <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(odds_ratios, conf_int)
## odds_ratios 2.5 % 97.5 %
## (Intercept) 0.0509099 0.01285711 0.1879144
## absences 1.0785172 1.03437018 1.1291483
## traveltime 1.5520466 1.10198801 2.1968917
## goout 2.1540600 1.71012923 2.7497327
## famrel 0.6956280 0.52984131 0.9103098
The odds of high alcohol consumption for: 1. absent students is from 1.03 and 1.12 higher than for attending students.
2. students with shorter travel time to school is 1.10 to 2.19 higher than for those who have to take the long road.
3. students who spend a lot of time with their friends is from 1.7 to 2.7 times higher for students who do not.
4. students with a good family situation is between 52% and 91% of the odds of students with bad family relationships.
To continue fitting the model, we will compare predicted and actual values.
# predict the probability of high_use
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = (probability > 0.5))
select(alc, absences, traveltime, goout, famrel, high_use, probability, prediction) %>% head(10)
## absences traveltime goout famrel high_use probability prediction
## 1 5 2 4 4 FALSE 0.47428353 FALSE
## 2 3 1 3 5 FALSE 0.13895457 FALSE
## 3 8 1 2 4 TRUE 0.13581670 FALSE
## 4 1 1 2 3 FALSE 0.11746601 FALSE
## 5 2 1 2 4 FALSE 0.09079210 FALSE
## 6 8 1 2 5 FALSE 0.09855191 FALSE
## 7 0 1 4 4 FALSE 0.28486278 FALSE
## 8 4 2 4 4 FALSE 0.45548223 FALSE
## 9 0 1 2 4 FALSE 0.07906088 FALSE
## 10 0 1 1 5 FALSE 0.02697575 FALSE
prediction_plot <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
prediction_plot + geom_point()
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64921466 0.05235602 0.70157068
## TRUE 0.18586387 0.11256545 0.29842932
## Sum 0.83507853 0.16492147 1.00000000
Not totally amazing, but still. The penultimate step would be to calculate the accuracy of the model.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the training data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2382199
Quite a good result!
Finally, we perform the 10-fold cross-validation.
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2513089
Sliiiiightly better than 0.26 by the DataCamp model.
The first thing in today’s agenda is to load the Boston dataset from r.
# loading
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
# initial dataset exploration
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
This dataset contains 506 observations over 14 variables, 2 of them interval and the other ones numerical. The indices were measured across the following variables:
1. ‘crim’ (per capita crime rate by town)
2. ‘zn’ (proportion of residential land zoned for lots over 25,000 sq.ft)
3. ‘indus’ (proportion of non-retail business acres per town)
4. ‘chas’ (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise))
5. ‘nox’ (nitrogen oxides concentration (parts per 10 million)) 6. ‘rm’ (average number of rooms per dwelling)
7. ‘age’ (proportion of owner-occupied units built prior to 1940) 8. ‘dis’ (weighted mean of distances to five Boston employment centres)
9. ‘rad’ (index of accessibility to radial highways) 10. ‘tax’ (full-value property-tax rate per $10,000) 11. ‘ptratio’ (pupil-teacher ratio by town)
12. ‘black’ (1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town)
13. ‘lstat’ (lower status of the population (percent))
14. ‘medv’ (median value of owner-occupied homes in $1000s)
The variables have very different ranges, which probably means standartisation before the analysis. Now onto a graphical overview.
pairs(Boston)
A lot of variable seem to be strongly correlated. The plot for ‘nox’ and ‘dis’, for instance, are seemingly related through 1/x function. It probably makes sense since air pollution is dissipating the farther you from the city center.
To check whether this is just an observation or a valid hypothesis, let’s check for correlations.
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
# round up to 2 digits
cor_matrix_r <- cor_matrix
round(cor_matrix_r, 2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# visualize
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor_matrix_r, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
The hypothesis is correct, the correlation coefficient is -0.77.
For classification and clustering purposes, next step would be to scale the dataset to avoid skewing results.
# scaling around 0
boston_scaled <- scale(Boston)
# getting a matrix as a result
class(boston_scaled)
## [1] "matrix"
# transforming the matrix into a new dataset
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Now the variables are on the same scale with a mean of 0. In order to work with classification, we need a binary or multiclass target variable. In this instance, we are most interested in classifying neighbourhoods accroding to crime rates, that is why we will next transform the ‘crim’ variable into categorical one via quantiles.
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
We are done with data preparation and now can go on separating the test and training sets for further model training and validation.
# determine the number of rows in the dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create a train set
train <- boston_scaled[ind,]
# create a test set by substraction of the train set
test <- boston_scaled[-ind,]
All the steps in preparing the data are complete.
Fitting the Linear Discriminant Analysis to the train data:
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2574257 0.2500000 0.2500000 0.2425743
##
## Group means:
## zn indus chas nox rm
## low 0.9695518 -0.8986348 -0.0830454025 -0.8756404 0.4878057
## med_low -0.1348835 -0.3134450 0.0005392655 -0.5531867 -0.1342093
## med_high -0.3751653 0.2211521 0.2734075986 0.4479629 0.1463542
## high -0.4872402 1.0171960 -0.0312821145 1.0898970 -0.4303178
## age dis rad tax ptratio
## low -0.8820174 0.7749371 -0.6815029 -0.7621389 -0.525308537
## med_low -0.2765583 0.2778917 -0.5565973 -0.4831296 -0.006900653
## med_high 0.4503409 -0.4079728 -0.3769361 -0.2766939 -0.343039722
## high 0.8216375 -0.8567975 1.6373367 1.5134896 0.779855168
## black lstat medv
## low 0.38267389 -0.78822123 0.58650165
## med_low 0.31265819 -0.14589919 0.01721007
## med_high 0.08389883 0.01674961 0.18385744
## high -0.66159869 0.87963513 -0.67824701
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08288436 0.712371236 -0.873055592
## indus 0.01371924 -0.253069409 0.026522722
## chas -0.07247237 -0.067707365 0.087303367
## nox 0.42893505 -0.738898977 -1.219089496
## rm -0.07377746 -0.139768243 -0.268140533
## age 0.29610796 -0.383542312 -0.048968773
## dis -0.02538277 -0.386201738 0.151613840
## rad 2.94034144 1.063946739 -0.349412621
## tax 0.05341197 -0.158988334 0.780717136
## ptratio 0.10550874 0.002075486 -0.017374248
## black -0.11256505 -0.006756739 0.026240739
## lstat 0.20540496 -0.206044671 0.367271246
## medv 0.19818226 -0.372006991 -0.003011808
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9411 0.0440 0.0149
The mean group results demonstrate that the three variables that are mostly influencing high crime rates are the index of accessibility to radial highways (1.63), full-value property-tax rate per $10,000 (1.51), and, interestingly, pollution rates (1.054).
The linear discriminants coefficient confirm those findings for the radial highways accessaibility which is three times higher than all the other variables in the dataset.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "black", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 3)
The plot confirms the prediction for the biggest influence of accessibility to radial highways.
Next, we will test the model that we have fitted on the test set, which we have previously separated from the data.
# save the correct classes from test data
correct_classes <- test$crime
# remove this variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 11 0 0
## med_low 7 15 3 0
## med_high 0 14 11 0
## high 0 0 0 29
The model looks quite adequate, there is a just a small number of mismatches in the classification.
Next, after trying to assign the measurements to previously determined classes, now we well turn to unsupervised methods. K-clustering begins, as usual, with loading and scaling data.
# loading
library(MASS)
data("Boston")
# scaling around 0
boston_scaled <- scale(Boston)
# getting a matrix as a result
class(boston_scaled)
## [1] "matrix"
# transforming the matrix into a new dataset
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
After the standartisation is complete, we need to calculate the distances for the scaled data.
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Next, we will try out a random number of clusters.
# k-means clustering
km <-kmeans(boston_scaled, centers = 10)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
The plot looks very colourful, but is is obvious that the number of clusters is too big. Right now we will reduce it in half.
# k-means clustering
km <-kmeans(boston_scaled, centers = 5)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
Looks better, but still confusing. It it time to more mathematical methods of identifying the right number of clusters for this dataset.
set.seed(123)
# determine the maximum number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
library(ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line')
According to the total sum of squares plot, the biggest observable “elbow” is set at 2. Therefore, we will run the k-means analysis with 2 centroids.
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
It turns out, that the most fitting number of classes for this dataset was just 2. A bit underwhelming though.